Back

PLOS Digital Health

Public Library of Science (PLoS)

Preprints posted in the last 90 days, ranked by how well they match PLOS Digital Health's content profile, based on 91 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.

1
Who is leading medical AI? A systematic review and scientometric analysis of chest x-ray research

Vasquez-Venegas, C.; Chewcharat, A.; Kimera, R.; Kurtzman, N.; Leite, M.; Woite, N. L.; Muppidi, I. J.; Muppidi, R. J.; Liu, X.; Ong, E. P.; Pal, R.; Myers, C.; Salzman, S.; Patscheider, J. S.; John, T. R.; Rogers, M.; Samuel, M.; Santana-Guerrero, J. L.; Yaacob, S.; Gameiro, R. R.; Celi, L. A.

2026-04-07 health informatics 10.64898/2026.04.02.26349884 medRxiv
Top 0.1%
51.7%
Show abstract

Computer vision models for chest X-ray interpretation hold significant promise for global healthcare, but their clinical value depends on equitable development across diverse populations. We conducted a scientometric analysis to examine authorship patterns, geographic distribution, and dataset origins to assess potential disparities that could affect clinical applicability. We systematically reviewed literature on computer vision applications for chest X-rays published between 2017-2025 across multiple databases, including PubMed, Embase and SciELO databases. Using Dimensions API and manual extraction, we analyzed 928 eligible studies, examining first and senior author affiliations, institutional contributions, dataset provenance, and collaboration patterns across different income classifications based on World Bank categories. High-income countries dominated research leadership, representing 55.6% of first authors and 59.7% of senior authors; no first authors were affiliated with low-income countries. China (16.93%) and the United States (16.72%) led in first authorship positions. Most datasets (73.6%) originated from high-income settings, with the United States being the largest contributor (40.45%). Private datasets were most frequently used (20.52%). Cross-income collaborations were rare, with only 3.9% of publications involving partnerships between high-income and lower-middle-income countries. Findings reveal substantial disparities in who shapes computer vision research on chest X-rays and which populations are represented in training data. These imbalances risk developing AI systems that perform inconsistently across diverse healthcare settings, potentially exacerbating healthcare inequities. Addressing these disparities requires coordinated efforts to develop globally representative datasets, establish equitable international collaborations, and implement policies that promote inclusive research practices.

2
CardioAI: An Explainable Machine Learning System for Cardiovascular Risk Prediction and Patient Retention in Nigerian Healthcare Settings

Gboh-Igbara, D. C.

2026-03-31 rehabilitation medicine and physical therapy 10.64898/2026.03.29.26349642 medRxiv
Top 0.1%
51.5%
Show abstract

Abstract Background: Cardiovascular disease is the leading cause of mortality in Nigeria and across sub-Saharan Africa, with rising incidence attributable to urbanisation, sedentary lifestyles, and limited access to early detection tools. Concurrently, patient dropout from rehabilitation programs remains a critical operational challenge for Nigerian clinics, with many patients failing to return after their initial consultation. Methods: We developed CardioAI, an Explainable Artificial Intelligence system comprising two predictive modules. The cardiovascular risk module trained four machine learning models - Logistic Regression, Random Forest, Gradient Boosting (XGBoost), and a Multilayer Perceptron - on a combined UCI Heart Disease dataset of 1,025 patient records. A novel Lifestyle Risk Index was engineered from five modifiable clinical markers. SHAP (SHapley Additive exPlanations) was applied for per-prediction feature attribution. The patient retention module trained three classifiers on a synthetic dataset of 800 records, modelling 10 operational and behavioural dropout factors. An NLP and OCR pipeline using Tesseract v5.5 and spaCy was implemented for clinical document processing. Results: The cardiovascular risk module achieved an AUC-ROC of 0.999 (XGBoost), 0.998 (Random Forest), 0.994 (MLP), and 0.927 (Logistic Regression) on the held-out test set. Cross-validated AUC with constrained tree depth was 0.97, confirming generalisation. SHAP analysis identified the Lifestyle Risk Index, ST depression, resting blood pressure, exercise-induced angina, and cholesterol as the five most influential predictors. The retention module achieved AUC-ROC of 0.66 (Logistic Regression), demonstrating the difficulty of dropout prediction with synthetic data. Conclusions: CardioAI demonstrates that explainable machine learning can provide clinically actionable cardiovascular risk assessment and patient retention intelligence in a low-resource Nigerian healthcare context. The system is freely deployable, open-source, and designed for pilot validation in teaching hospitals across Lagos and Port Harcourt. Keywords: cardiovascular risk prediction, machine learning, explainable AI, SHAP, patient retention, clinical decision support, Nigeria, sub-Saharan Africa, XGBoost, random forest, digital health

3
Explainable machine learning for revisiting reported Irritable Bowel Syndrome correlates in a student cohort

Ramirez-Lopez, L.; Kang, P.

2026-04-15 gastroenterology 10.64898/2026.04.13.26350820 medRxiv
Top 0.1%
42.7%
Show abstract

Irritable Bowel Syndrome (IBS) affects a substantial proportion of university students, yet its factors remain incompletely characterised in South Asian populations. We reanalysed a publicly available dataset of 550 Bangladeshi students from Hasan et al. [1], conducting a data audit that identified implausible records, including males reporting menstrual symptoms, and reduced the analytic sample to 506 observations. Using Explainable Boosting Machines (EBMs), which capture non-linear effects and pairwise interactions without sacrificing interpretability, we found that psychological distress, elevated BMI and academic dissatisfaction were the strongest predictors of IBS (mean AUC = 0.852 across 100 stratified train-test splits). Critically, several findings diverged from the original logistic regression analysis. Physical activity showed a non-linear risk pattern only at high intensity, the association with gender was substantially weaker when we accounted for metabolic and psychological factors as well and malnourishment does not have a strong an impact as in the original study. These divergences likely arise because the machine-learning model captures non-linear effects and interactions that were not represented in the original regression specification. Our findings underscore the value of reanalysing existing datasets with methods suited to capturing complexity and highlight data quality verification as a necessary step in the secondary analysis. Author summaryWe reanalysed a dataset on Irritable Bowel Syndrome (IBS) among university students in Dhaka, Bangladesh. Before modelling, we audited the dataset, removed implausible records, and reconstructed the IBS classification from the Rome III questionnaire. We then applied an interpretable machine-learning model capable of modelling non-linear effects and interactions between variables. Psychological distress (particularly anxiety and stress), body mass index, and dissatisfaction with academic major showed the strongest associations with IBS. The model also identified several interaction effects involving BMI. Our results differ in several respects from the original regression analysis, suggesting that modelling assumptions and data validation can influence the interpretation of IBS correlates. This study shows how explainable machine-learning models can complement conventional statistical analyses and how data validation can affect results in secondary analyses.

4
Performance optimization of an R Shiny-based digital health dashboard for monitoring small and sick newborn care in low-resource hospital settings

Thomas, J.; Jenkins, G.; Chen, J.; Ogero, M.; Malla, L.; Hirschhorn, L. R.; Richards-Kortum, R.; Oden, Z. M.; Bohne, C.; Wainaina, J.

2026-03-19 health systems and quality improvement 10.64898/2026.03.08.26347893 medRxiv
Top 0.1%
32.0%
Show abstract

BackgroundDigital health dashboards can enhance health system performance by transforming routinely collected data into actionable insights for decision-making. In low-resource settings, however, their effectiveness depends not only on the relevance of indicators but also on system reliability within constrained digital infrastructure. Neonatal mortality remains a major global health challenge, with the highest burden in low- and middle-income countries, where many deaths are preventable through timely, evidence-based interventions. Continuous monitoring of care processes and outcomes is therefore essential. To support this need, we developed the NEST360 Implementation Tracker (NEST-IT) using R Shiny to support quality improvement across more than 100 hospitals in sub-Saharan Africa. As the platform scaled to over half a million records and increasing concurrent users, performance constraints emerged, particularly in hospitals with limited computing resources, threatening timely access to critical information. ObjectiveThis study aimed to describe optimization strategies applied to the NEST-IT dashboard and evaluate their impact before and after implementation. MethodsA structured optimization process was implemented following established R Shiny performance principles. Dashboard profiling was first conducted to identify key bottlenecks, after which targeted improvements were applied to improve efficiency and responsiveness. A quasi-experimental pre-post evaluation (December 2023-August 2024) assessed performance using three indicators: server processing time, visualization rendering time (VRT), and Time to First Byte (TTFB). Metrics were measured repeatedly during one-month baseline and post-optimization periods and summarized using mean values. ResultsFour primary bottlenecks were identified: delayed server responses, slow visualization rendering, inefficient data handling, and inconsistent device performance. Following optimization, interactive plot load time decreased from 10.1 to 2.7 {+/-} 0.6 seconds (73.3% improvement). Visualization rendering improved from 3.61 to 1.62 seconds, while server processing time fell from 2.3 {+/-} 0.7 to 0.8 {+/-} 0.3 seconds. TTFB improved from 1.9 {+/-} 0.4 to 0.6 {+/-} 0.2 seconds, and system uptime increased from 92.5% to 99.2%. ConclusionPerformance optimization substantially improved dashboard responsiveness, enabling timely access to critical neonatal information in resource-constrained hospital settings. The findings provide practical, evidence-based framework for improving the performance of R Shiny dashboards and demonstrate scalable strategies for delivering reliable digital decision-support tools in low-resource health systems.

5
Human-supervised, large language model-based clinical decision support aligned to national newborn protocols in Kenya: a pragmatic, early-stage evaluation

Kuria, T.; Kamau, G.; Makokha, F.; Omondi, P.; Mbugua, G.; David, K.; Mbugua, S.; Gitaka, J.

2026-03-25 health informatics 10.64898/2026.03.22.26348994 medRxiv
Top 0.1%
31.9%
Show abstract

Introduction: Timely, protocol-adherent clinical decisions are crucial for reducing neonatal mortality in low-resource settings. Translating extensive national guidelines into bedside practice remains challenging. Objective: We developed and evaluated AIFYA, a human-supervised, large language model LLM based clinical decision support system CDSS aligned with Kenya's national newborn care protocols. Methods: This prospective mixed methods early stage evaluation guided by the DECIDE-AI framework embedded AIFYA into routine workflows at two public health facilities Level 5 and Level 4 in Bungoma County Kenya from September 2024 to June 2025. Primary outcomes were adoption measured by cumulative neonatal cases managed training reach assessed by credentialed healthcare workers HCWs and guideline and citation concordance evaluated through blinded review of 118 AI generated recommendations by two neonatologists with adjudication by a third. Secondary outcomes included protocol adherence and triage to decision time. Results: A total of 50 HCWs were trained and 550 neonatal cases were managed over 10 months. Among surveyed HCWs n equals 33, 76 percent were female with mean age 32.1 years. Expert review found 75 percent of recommendations were correct and 15 percent partially correct with strong inter rater reliability weighted Cohen's kappa 0.85 and 95 percent CI 0.79 to 0.91. Citation accuracy was 96 percent. In 40 complex dosing scenarios 75 percent of outputs were rated correct. The median triage to decision time was 23 minutes with interquartile range 18 to 31. Implementation was supported by an offline first architecture and a facility based coaching model sustaining engagement despite staff turnover. Conclusion: A human supervised AI CDSS directly and transparently anchored to national clinical guidelines can be successfully implemented in routine low resource neonatal care settings. The system demonstrated high user adoption and strong expert rated concordance. High citation accuracy builds clinical trust ensuring safety and enabling auditable AI. These findings support progression to controlled multi site trials to evaluate clinical effectiveness. Keywords: Neonatal care Clinical decision support system Large language model Artificial intelligence Human supervised Low resource settings Guideline adherence Digital health Kenya

6
PhysiCase: Development and dual-layer validation of synthetic cases for health professional education: A pilot study leveraging Generative AI

Komolafe, O. O.; Roberts, A. C.; Shelley, J.; Tawiah, A. K.

2026-06-09 rehabilitation medicine and physical therapy 10.64898/2026.06.07.26355114 medRxiv
Top 0.1%
27.0%
Show abstract

High-quality, domain-specific datasets are foundational to advancing educational tools and AI systems in healthcare, yet assembling case repositories from real-world clinical records faces substantial privacy, ethical, and licensing barriers. Synthetic data generation offers a compelling pathway forward, but educational cases require rigorous validation to ensure clinical plausibility and pedagogical utility. This pilot study introduces PhysiCase, a dual-layer validation pipeline for synthetic case generation and evaluates the feasibility of combining automated LLM-based screening with expert educator review. We generated 128 synthetic musculoskeletal(MSK) cases using four frontier large language models (GPT-4.1, GPT-4o, Google Gemini 2.5 Pro, and Llama 4 Scout) across 28 clinical conditions. Cases underwent automated quality screening using an "LLM-as-judge" framework (DeepEval) assessing prompt alignment, JSON correctness, answer relevance, bias, toxicity, and completeness. Ninety cases (70.3%) passed automated filtering and proceeded to expert evaluation by four MSK physiotherapy educators, who rated medical accuracy, realism, fidelity, relevance, and usability on 5-point Likert scales. GPT-4.1 demonstrated the highest automated pass rate (96\%) and strongest expert ratings (medical accuracy 4.10/5, usability 4.38/5), while Llama 4 Scout showed the lowest pass rate (33.3%) and expert ratings. Expert-evaluated cases achieved strong content validity indices for usability (97.5%), relevance (97.5%), and realism (95%), though medical accuracy showed greater variance (CVI 87.5%). Cross-layer correlation analysis revealed that automated completeness metrics moderately aligned with expert usability ratings , while answer relevance and prompt alignment showed weak or negative correlations with clinical correctness. Qualitative analysis identified three primary failure modes: reductive logic, biomechanical inconsistency, and administrative/contextual gaps. The dual-layer validation framework proved methodologically viable: automated screening efficiently reduced expert review burden, while human judgment remained indispensable for detecting subtle clinical reasoning failures. LLM-generated synthetic cases has the potential to meet practical educational needs for MSK physiotherapy, but expert validation is essential to safeguard clinical accuracy. These findings support a scalable division of labour for synthetic case development, with targeted improvements to prompting and automated reasoning checks needed to address identified "nuance gaps." The code for this paper is available on https://github.com/kwid-ai/PhysiCase

7
Co-creating data science solutions for maternal and child health decision-making in tribal primary health centres: an action research using the Three Co's Framework

Mitra, A.; Jayaraman, G.; Ondopu, B.; Malisetty, S. K.; Niranjan, R.; Shaik, S.; Soman, B.; Gaitonde, R.; Bhatnagar, T.; Niehaus, E.; K.S, S.; Roy, A.

2026-03-31 public and global health 10.64898/2026.03.29.26349643 medRxiv
Top 0.1%
25.9%
Show abstract

Background: Digital health tools are increasingly promoted for strengthening health information systems in low- and middle-income countries, yet routine maternal and child health (MCH) data in tribal primary health centres (PHCs) in India remains underutilised for local decision-making. Top-down digital tools often fail in low-resource settings because they are designed without meaningful input from end-users. Co-creation approaches for digital health in tribal and indigenous settings are largely unexplored. Methods: We conducted an action research study in three tribal PHCs under the Integrated Tribal Development Agency (ITDA), Rampachodavaram, Andhra Pradesh, India. We applied the Three Co's Framework (Co-Define, Co-Design, Co-Refine) to co-create data science solutions for MCH decision-making with five medical officers, 24 auxiliary nurse midwives, and 36 accredited social health activists across two action research cycles (August 2023 to August 2024). Co-creation involved collaborative indicator definition, data modelling, data quality validation, health facility catchment area construction, spatial analysis, and interactive dashboard development. Keller's Data Science Framework was employed using R to structure the analytical pipeline, and Data.org's Data Maturity Assessment (DMA) was used to assess organisational data maturity pre- and post-intervention. Findings: During Co-Define, co-creators identified a fundamental mismatch between system outputs (aggregate statistics for upward reporting) and their operational need for individual-level, geographically disaggregated, prospective information. Co-Design produced five interconnected data science solutions: (1) 42 co-defined MCH indicators grounded in clinical workflows; (2) a data model linking individuals, health services, providers, and facilities; (3) a data quality framework using the pointblank R package; (4) health facility catchment area boundaries constructed from scratch using medical officers' local knowledge, enabling spatial analysis that revealed significant clustering of ANC coverage and anaemia prevalence; and (5) an R Shiny dashboard integrating these solutions into an offline-capable interface with lifecycle-organised views and village-level navigation. The DMA showed moderate improvement in organisational data maturity from 5.04 to 5.75 out of 10, with the largest gain in Analysis (+1.90). Co-Refine continued beyond the formal study period, with two transferred medical officers maintaining analytical engagement from new postings. Interpretation: The Three Co's Framework, combined with a data science approach, provided a structured yet flexible method for co-creating locally relevant data science solutions in a tribal setting. The framework's explicit separation of problem definition from solution design was particularly valuable in a context where "the problem" is typically defined externally. Co-creation in tribal digital health settings is feasible and produces solutions that address locally articulated needs.

8
AI Adoption for NCDs in Kenya: A Qualitative Study

Rayo, J.; Cushny, W.; Mwangi, M.; Wanyee, S.; Linguraru, M. G.; Nyaga, N.; Koros, H.; Bosire, M.; Obuya, M.; Ngaruiya, C.

2026-05-27 public and global health 10.64898/2026.05.26.26354008 medRxiv
Top 0.1%
25.5%
Show abstract

Background: Non-communicable diseases (NCDs) represent a critical public health challenge in Kenya, responsible for over 50% of inpatient admissions and 40% of deaths. While digital health tools and artificial intelligence offer promising ways to improve prevention, diagnosis, and management, little is known about how these tools are perceived and used in practice. There is limited research exploring the views and lived experiences of young people in Kenya, who are a strategic priority for NCD prevention because behavioral risk factors are established in this window, and for Community Health Providers (CHPs) who provide health services within the community. This study aims to address this gap by examining the perspectives of the burden of non-communicable diseases and the potential role of digital health technologies, including artificial intelligence, for preventing and managing these conditions in these specific populations. Methods: A qualitative research design using focus group discussions (FGDs) was employed in Nairobi (urban) and Busia (rural) counties between March and July 2024. Eight FGDs were conducted with 60 participants purposively sampled from three stakeholder groups: community health promoters (CHPs), healthcare workers (HCWs), and youth aged 18-35 years. A semi-structured guide, co-developed with a Community Advisory Board, explored beliefs about NCDs, health-seeking behaviors, lifestyle practices, and attitudes toward digital health and AI. Audio recordings were transcribed verbatim, translated where necessary, and analyzed thematically using grounded theory principles on NVivo software (v12). Results: Six consolidated themes emerged: (1) understanding of NCDs and perceived risk; (2) barriers to NCD prevention and care; (3) the role of CHPs; (4) adoption of AI tools for NCD management; (5) trust, ethics and access concerns; and (6) community-driven recommendations for AI integration. Significant barriers including stigma, economic constraints, and barriers to care were documented alongside enthusiasm for AI tools among youth and CHPs in both urban and rural areas. Conclusion: This study shows that AI tools are being used for NCD prevention and management through spontaneous community adoption. However, it emphasizes the need for culturally relevant, equitable, and community-driven solutions. Effective scaling requires the identification and bridging of digital literacy gaps, the establishment of affordable infrastructure, the protection of data privacy, and the integration of artificial intelligence tools into existing community health frameworks. This process should involve the collaboration of trusted intermediaries, such as CHPs and community leaders, to ensure successful outcomes. Future initiatives should prioritize participatory design, policy frameworks for ethical governance, and targeted capacity building to enhance acceptance and sustainability of digital health innovations in low- and middle-income country settings.

9
Towards Integrated Digital Health Systems for Nutrition and Food Security in Uganda: A Cross-Sectional Survey

Samnani, A. A.; Kimbugwe, N.; Nduhuura, E.; Katarahweire, M.; Kanagwa, B.; Crowley, K.; Tierney, A.

2026-04-06 health systems and quality improvement 10.64898/2026.04.05.26350208 medRxiv
Top 0.1%
23.0%
Show abstract

Despite robust policy frameworks, Ugandas digital health landscape is characterised by fragmentation--often termed "Pilotitis"--where stand-alone applications impede the integrated delivery of health, nutrition, and food security services. As part of the IGNITE project, this study mapped existing digital health systems (DHSs), identified systemic gaps, and explored opportunities and resource requirements for sustainable integration of existing Health, Nutrition and Food security data systems. The IGNITE project adopted a mixed-methods design; however, this paper reports findings from the first phase--a national cross-sectional survey conducted in Uganda. The survey mapped digital health, nutrition, and food security systems, identifying gaps, resource needs, and potential actions. Stakeholders from government, NGOs, academia, UN agencies, and frontline health workers were included using purposive and snowball sampling. Data were collected online and through field support. Of 134 respondents, 110 with [≥]70% survey completion was included in the analysis. While 93% of respondents utilise digital tools (predominantly DHIS2 and mobile apps), only 20% reported full automated integration with national platforms. Critical barriers to interoperability included a lack of technical expertise (90%), insufficient DHIS2 training (82%), different data formats (77%), and infrastructure constraints (75%). Respondents identified workforce development (56%) and DHIS2 use and adoption (29%) as primary opportunities. Immediate priorities include staff training and provision of mobile hardware, while long-term strategies focus on standardised data formats (78%) and formalised governance frameworks for Integrated platforms (64%) and automated data exchange (56%). Uganda possesses a vibrant but disconnected digital ecosystem. Transitioning from isolated "data islands" to a cohesive system requires addressing the massive technical capacity gap and establishing mandated interoperability guidelines. The findings provide a data-driven roadmap for the Ministry of Health and partners to optimise digital health adoption, ensuring that nutrition and food security interventions are supported by a unified, evidence-informed digital architecture

10
Technology acceptance of machine learning in life sciences: the role of hype perception and journal impact factor.

Serrano, A. E.

2026-06-09 health informatics 10.64898/2026.06.03.26354262 medRxiv
Top 0.1%
22.9%
Show abstract

Machine learning (ML) has emerged as a transformative technology across biomedical and life science sectors, with applications spanning drug discovery, medical imaging, genomics, and clinical decision support (Goecks et al., 2020; Patel et al., 2020). Despite exponential growth in ML-related publications, from fewer than 100 articles in 2003 to nearly 25,000 by 2021 (NCBI, 2022), adoption among industry professionals remains uneven and sector-dependent. Understanding what drives or inhibits this adoption is critical for organisations seeking to leverage ML capabilities in research and clinical practice. Technology adoption in organisational contexts has been extensively studied through the Technology Acceptance Model (TAM), originally proposed by Davis (1989) and subsequently extended to incorporate external variables influencing perceived usefulness (PU) and perceived ease of use (PEU) (Venkatesh & Davis, 1996). While TAM has been applied across multiple industries, its application within biomedical and life science contexts remains limited, and the industry-specific factors that shape ML acceptance in this sector have not been systematically examined. Two external variables are particularly relevant to life science professionals. First, the bibliometric journal impact factor (JIF) functions as a cognitive signal of scientific credibility, a sector where evidence-based decision-making is culturally embedded, and publication quality serves as a proxy for technological legitimacy (Garfield, 1996). Second, technology hype, operationalised through the Gartner Hype Cycle framework, represents a social influence variable that shapes organisational expectations and investment decisions around emerging technologies (Gartner Inc., 2018). Whether these variables influence ML acceptance among life science professionals, alongside individual knowledge and experience, has not been empirically tested. This study addresses that gap by investigating ML technology acceptance among 213 biomedical and life science professionals across EMEA, LATAM, and North America, using a cross-sectional quantitative survey and PLS-SEM analysis. The TAM model is extended with three external variables, JIF, technology hype, and prior knowledge and experience, to test their influence on PU and PEU in this specific professional context. Additionally, the study examines demographic and regional differences in ML acceptance, with particular attention to variation between academic researchers and healthcare professionals. The findings contribute a validated, sector-specific extension of TAM for life sciences, provide actionable insights for organisations seeking to accelerate ML implementation, and establish a framework for future subsector-specific research.

11
Recovering Clinical Detail in AI-Generated Responses for Low Back Pain Through Prompt Design

Basharat, A.; Hamza, O.; Rana, P.; Odonkor, C. A.; Chow, R.

2026-04-23 pain medicine 10.64898/2026.04.21.26351437 medRxiv
Top 0.1%
22.9%
Show abstract

IntroductionLarge language models are increasingly being used in healthcare. In interventional pain medicine, clinical reasoning is essential for procedural planning. Prior studies show that simplified prompts reduce clinical detail in AI-generated responses. It remains unclear whether this reflects knowledge loss or simply prompt-driven suppression of information. MethodsWe performed a controlled comparative study using 15 standardized low back pain questions representing common interventional pain questions. Each question was submitted to ChatGPT under three conditions, professional-level prompt (DP), fourth-grade reading-level prompt (D4), and clinician-directed rewriting of the D4 response to a medical level (U4[->]MD). No follow-up prompting was allowed. Three physicians independently rated responses for accuracy using a 0-2 ordinal scale. Clinical completeness was determined by consensus. Word count and Flesch-Kincaid Grade Level (FKGL) were also measured. Paired t-tests compared conditions. ResultsAccuracy was highest with professional prompting (1.76). Accuracy declined with the fourth-grade prompt (1.33; p = 0.00086). When simplified responses were rewritten for clinicians, accuracy returned to baseline (1.76; p {approx} 1.00 vs DP). Clinical completeness followed the same pattern showing DP 80.0%, D4 6.7%, U4[->]MD 73.3%. Fourth-grade responses were shorter and less complex. Upscaled responses were more complex and similar in length to professional responses. Inter-rater reliability was low (Fleiss {kappa} = 0.17), but trends were consistent across conditions. ConclusionsReduced clinical detail under simplified prompts appears to reflect constrained output rather than loss of knowledge. Clinician-directed reframing restores omitted content. LLM performance in interventional pain depends strongly on prompt design and intended audience.

12
Comparison of foundation models and transfer learning strategies for diabetic retinopathy classification

Li, L. Y.; Lebiecka-Johansen, B.; Byberg, S.; Thambawita, V.; Hulman, A.

2026-04-20 health informatics 10.64898/2026.04.17.26351092 medRxiv
Top 0.1%
22.5%
Show abstract

Diabetic retinopathy (DR) is a leading cause of vision impairment, requiring accurate and scalable diagnostic tools. Foundation models are increasingly applied to clinical imaging, but concerns remain about their calibration. We evaluated DINOv3, RETFound, and VisionFM for DR classification using different transfer learning strategies in BRSET (n = 16,266) and mBRSET (n = 5,164). Models achieved high discrimination in binary classification (normal vs retinopathy) in BRSET (AUROC 0.90-0.98), with DINOv3 achieving the best under full fine-tuning (AUROC 0.98 [95% CI: 0.97-0.99]). External validation on mBRSET showed decreased performance for all models regardless of the fine-tuning strategy (AUROC 0.70-0.85), though fine-tuning improved performance. Foundation models achieved strong discrimination but poor calibration, generally overestimating DR risk. While the generalist model, DINOv3, benefited from deeper fine-tuning, miscalibration remained evident. These findings underscore the need to improve calibration and the comprehensive evaluation of foundation models, which are essential in clinical settings. Author summaryArtificial intelligence is increasingly being used to detect eye diseases such as diabetic retinopathy from retinal images. Recent advances have introduced "foundation models," which are trained on large datasets and can be adapted to new tasks. We aimed to evaluate how well these models perform in a clinical prediction context, with a focus not only on accuracy but also on how reliably they estimate disease risk. In this study, we compared different types of foundation models using two independent datasets from Brazil. We found that while these models were generally good at distinguishing between healthy and diseased eyes, their predicted risks were often poorly calibrated. In other words, the estimated probabilities did not consistently reflect the true likelihood of disease. We also examined whether adapting the models to the target population could improve performance. Although this approach led to improvements, calibration issues remained. However, post-training correction improved the agreement between predicted risks and observed outcomes. Our findings highlight an important gap between model performance and clinical usefulness. We suggest that improving the reliability of risk estimates is essential before such systems can be safely used in healthcare.

13
Daily symptom monitoring is sustainable over months: retention, not compliance, is the primary barrier to long-duration digital tracking

Gunsilius, C. Z.; Pei, P.; Carayannopoulos, A.; Petzschner, F. H.

2026-06-10 rehabilitation medicine and physical therapy 10.64898/2026.06.08.26355180 medRxiv
Top 0.1%
22.4%
Show abstract

Ecological momentary assessment (EMA) enables real-time, longitudinal measurement of symptoms and behavior via smartphones, yet nearly all feasibility evidence comes from protocols lasting one to two weeks, far shorter than the timescales over which chronic diseases fluctuate and clinical decisions unfold. Whether daily compliance can be sustained over months, or whether it decays as short-protocol trends predict, is unknown. Here, 214 participants (173 with pain, 41 healthy controls) completed a 4-month (122-day) EMA protocol via the Soma smartphone app, generating 26,907 check-ins. Half the sample completed the full protocol without a two-week lapse. Aggregate compliance appeared moderate (50%), but this conflated two distinct phenomena: when recomputed over each participant's active period, compliance rose to 71%, with 91% achieving moderate-to-high adherence, and remained stable across all 17 study weeks. Pain status predicted earlier disengagement but not lower compliance among those who remained; after adjustment for differential retention, group differences disappeared. To our knowledge, this is the longest continuous daily EMA evaluation in a clinical population. It suggests the primary barrier to long-duration EMA is not declining motivation among active participants but concentrated early disengagement, with direct implications for the design of digital health protocols, decentralized trials, and remote symptom monitoring.

14
Digital Health and Data Utilisation for Improved Primary Health Services Delivery: Multi-Site Perspectives from Quality Improvement Teams in Council Hospitals in Tanzania

Matimo, C. R.; Kacholi, G.; Mollel, H. A.

2026-04-17 health systems and quality improvement 10.64898/2026.04.10.26350674 medRxiv
Top 0.1%
22.0%
Show abstract

BackgroundDigital health plays an indispensable role in facilitating data analysis and use for enhancing healthcare delivery across health settings. However, there is scant information on the extent to which digital health influences the improvement of primary health services delivery through data use. This study examined the determinants that influence the use of digital health to improve health service delivery in council hospitals in Tanzania. MethodsA cross-sectional design was employed in six regions, involving 12 council hospitals. We used a self-administered questionnaire to collect data from 203 members of hospital quality improvement teams. Descriptive analysis was used to determine the frequency, proportion, and mean of responses, while bootstrapping analysis was conducted to test the statistically significant influence of digital health factors on data use for improving health service delivery. ResultsResults show moderate agreement on data compatibility for planning and decision-making, with 40.4% of respondents agreeing it supports ordering commodities, 43.8% for staff allocation, and 38.4% for planning. However, dissatisfaction was higher for user-friendliness (47.8%), reliability (up to 65.5%), and usefulness (up to 63.5%). Overall, 50.2% (M=2.74{+/-}0.87) disagreed that digital systems effectively support data use. Structural model analysis confirmed significant positive influence of usefulness ({beta}=0.199, p<0.001) and access to quality data ({beta}=0.729, p<0.001) on data use, which strongly impacted service delivery ({beta}=0.593, p<0.001), despite some factors showing no direct influence. ConclusionThe study finds that current digital health initiatives only modestly improve the user-friendliness, reliability, and usefulness of data systems, partly due to fragmented, non-interoperable platforms that burden data management. However, compatibility, usability, reliability, and usefulness of digital tools significantly enhance access to quality data and data-driven decisions. The study recommends strengthening and integrating existing systems and providing continuous digital health training to institutionalize data-informed decision-making.

15
Accuracy and Consistency of Frontier LLMs on Orthodontic Diagnostic Tasks: A Repeated-Trial Comparison

Kang, W. J.; Sim, J.; Loh, E. E. M.; Lim, A. C. Y.; FOONG, K. W. C.

2026-05-20 health informatics 10.64898/2026.05.17.26353409 medRxiv
Top 0.1%
20.0%
Show abstract

Importance. Large language models are increasingly explored as clinical decision support tools in orthodontics, yet existing evaluations have been confined to knowledge based question answering where reported accuracy ranges from 18% to 100%. No study has evaluated performance on the computational and classificatory tasks that define daily diagnostic work. Furthermore, 84.3% of published healthcare large language model studies fail to report the number of repeated queries performed, leaving output stochasticity unexamined. Objective. To compare the diagnostic accuracy and output consistency of three frontier reasoning-enhanced large language models, namely, ChatGPT 5.4 (Thinking), Gemini 3 (Thinking), and Claude Opus 4.6 (Extended Thinking), on Bolton analysis, Index of Orthodontic Treatment Need-Dental Health Component (IOTN DHC) classification, space analysis, and lateral cephalometric interpretation. Methods. In this comparative cross-sectional study with a repeated-measures design, each model, accessed through its respective consumer facing web interfaces under default provider settings rather than through application programming interfaces, processed 200 purpose-built items (50 per task) across four independent trials, yielding 2,400 observations. Responses were scored against a pre-established reference standard by two independent raters using strict binary exact match criteria. Accuracy was reported with exact binomial 95% confidence intervals. Inter-model comparisons used Cochran's Q test with post-hoc McNemar's tests and Bonferroni correction. A supplementary context-rich prompting evaluation was conducted on 40 items (480 observations). Results. Claude Opus 4.6 (Extended Thinking) achieved the highest accuracy (99.0%; 95% CI: 96.4 to 99.9%), followed by Gemini 3 (Thinking) (95.5%; 91.6 to 98.1%) and ChatGPT 5.4 (Thinking) (94.0%; 89.8 to 96.9%) (Cochran's Q=6.87, p=0.032). Each model exhibited distinct, non-overlapping error profiles concentrated at the normal-abnormal classification boundary. An accuracy-consistency paradox emerged: the most accurate model was the least consistent (93.0%), while the least accurate was the second-most consistent (98.0%). Context rich prompting eliminated all errors across all three models. Interpretation. Frontier reasoning large language models achieved high overall accuracy on orthodontic diagnostic tasks but retained concealed, task-specific vulnerabilities detectable only through repeated-trial evaluation. An accuracy-consistency paradox, in which the most accurate model was the least consistent, demonstrates that single-trial evaluations cannot characterise clinical risk. The reasoning modes were associated with high arithmetic accuracy but did not compensate for imprecise parametric knowledge on classification tasks; however, the absence of a non-thinking baseline means this association cannot be attributed to the thinking mode itself. Context-rich prompting eliminated all errors on synthetic data but should be regarded as a necessary yet insufficient prerequisite for clinical deployment pending prospective validation on real patient data.

16
Imbalance-Aware Optimal Transport Learning for Cost-Effective Diabetic Retinopathy Screening

SHI, M.; Afolabi, S. O.

2026-04-18 ophthalmology 10.64898/2026.04.16.26351035 medRxiv
Top 0.1%
19.7%
Show abstract

BackgroundDiabetic Retinopathy (DR) is one of the leading cause of vision loss and blindness. AI models have been instrumental in providing an alternative solution to real-life medical treatment which are costly and sometimes not readily available in developing and underdeveloped nations. However, most of the existing AI models are developed with high-quality clinical images that makes it difficult to use such models in low-resource settings. For this reason, this research focus on bridging this gap by developing a low-resource, mobile-friendly, and deployable deep learning (DL) model for the detection of DR using an imbalance-aware optimal transport (OT) learning approach. MethodsWe trained our proposed framework with both high-quality hospital-grade images and low-resource smartphone-acquired images, and evaluated with the original test set from the smartphone domain. We also curated three levels of smartphone image-degradation quality and reported results from multiple experiments with bootstrapping. All model evaluations were assessed using the AUC, Sensitivity, and Specificity. Our results were compared with empirical risk minimization (ERM), Prototype OT, and Sinkhorn OT methods. ResultsWe used four strong backbone architectures in the assessment. With our framework, Mobilevit-s achieved the best performance: an AUC of 87%, sensitivity of 89%, and specificity of 95%. Meanwhile, the statistical significance performance test (95% CI) shows that the AUC results are in the range of approximately 84% to 89%. For sensitivity, the range is 81% to 96%, and for specificity, 93% to 96%. This result indicated a performance increase of more than 3-5% compared to baseline methods. ConclusionOur framework shows promising results for low-resource DR screening, which has a potential to benefit less-advantaged groups and developing nations.

17
Pixel-Based Skin Tone Estimation on Dermoscopy: A Dual-Rater MST Benchmark and Feasibility Study

Kumarasinghe, A.; Bui, V.; Ghanbarzadeh, R.

2026-05-17 health informatics 10.64898/2026.05.13.26353004 medRxiv
Top 0.1%
19.2%
Show abstract

Skin-tone labels are absent from public dermoscopy benchmarks such as the International Skin Imaging Collaboration (ISIC), making it impossible to audit whether clinical AI performs equitably across skin tones. While several recent works estimate skin tone automatically from clinical photography and selfies, we ask whether this approach is feasible on dermoscopy, the primary imaging modality of these benchmarks. To answer this, we make three main contributions. First, we release MST-Derm, a dual-rater Monk Skin Tone (MST) annotation benchmark on 500 ISIC 2018 images. Raters were given an explicit unrateable option for crops where the skin surrounding the lesion was too occluded to label confidently. We find that 60% of images were marked unrateable, yielding a 193-image consensus subset (quadratic-weighted Cohen's Kappa = 0.82). Second, we conduct a systematic feasibility study of three pixel-based MST annotation pipelines spanning the principal families in prior work: palette matching in perceptual colour space, robust colour statistics, and projection to a 1D colorimetric scalar. All three pipelines produce ordinal signal above chance (95% confidence intervals on quadratic-weighted Kappa exclude zero). However, ISIC 2018's extreme light-skin bias leaves 82% of the evaluation set at MST 2, giving a constant "always predict MST 2" baseline an accuracy floor the methods cannot overcome. To separate algorithmic signal from dataset bias, we evaluate on a class-balanced subset. The best method reaches quadratic-weighted Kappa = 0.43 against the trivial baseline of Kappa = 0.00, confirming the signal is genuine. Third, we diagnose this performance ceiling. We trace the bottleneck to two causes: dermoscopy's specialised illumination physically compresses the colour range on which lighter skin tones differ, and ISIC's dataset skew makes standard absolute-accuracy metrics uninformative. We conclude that while pixel-based colour features carry real MST signal on dermoscopy, current performance is insufficient for autonomous annotation. We release the benchmark, annotation protocol, all prediction runs, and analysis code to facilitate the development of robust skin-tone estimators, a vital prerequisite for accurately auditing fairness and mitigating bias in dermatological machine learning.

18
Outcome Prediction Models for Critically Ill Patients Using Small Routine Laboratory Datasets

Cao, X.; Hou, J.; Wei, X.; Wang, Q.

2026-04-27 emergency medicine 10.64898/2026.04.26.26351758 medRxiv
Top 0.1%
18.8%
Show abstract

We present a suite of foundational, outcome prediction models for critically ill patients, developed using readily available, routine blood tests and advanced machine learning techniques. The input data of the models includes complete blood counts (CBCs), metabolic panels, and additional biomarkers that assess liver and kidney function, coagulation status, and cardiac injury. The output yields the predicted outcome at a given future horizon. For diagnoses, the length of the future horizon is set to zero while it is set to a fixed time interval for prognoses. The training dataset in this study comprises clinical data from 332 ICU patients, augmented with 200 synthetic samples generated via a conditional diffusion model. Generative machine learning-based data imputation and augmentation approaches yielded modest gains in predictive accuracy. However, substantial performance improvements were achieved through additional methods, including dimensionality and order reduction, SHAP-based feature importance analysis, and a novel time-series-to-image encoding strategy that enables the use of image-based classifiers for temporal clinical data. Principal component analysis-based order reduction produced measurable gains in outcome prediction, while the time-series-to-image encoding proved particularly effective in mitigating small-data limitations common in clinical research. Across all evaluation metrics--accuracy, precision, recall, F1 score, and AUROC--the prognostic models achieved performance exceeding 85%, with some models attaining AUROC scores above 90%. We innovated a new model-ensemble approach to optimize the predictive outcome. This ensemble modeling approach improves the overal prediction, pushing all assessment metrics over 90%. This work establishes a robust and interpretable AI-enabled diagnostic and prognostic toolkit for outcome predictions in critically ill patients and demonstrates a scalable workflow for developing high-performing models from sparse healthcare datasets. The proposed framework is readily deployable in ICU environments with routine blood testing capabilities and serves as a foundation for future integration into digital twin systems for critical care.

19
ChatIBD: design, safeguards, and early international use of a guideline-grounded generative AI tool for inflammatory bowel disease (IBD) professionals

Chuah, C. S.; Gros, B.; Plevris, N.

2026-05-07 gastroenterology 10.64898/2026.05.06.26352526 medRxiv
Top 0.1%
18.7%
Show abstract

ObjectivesTo describe the design, operational safeguards, and early use of ChatIBD, a specialty-specific generative AI platform for inflammatory bowel disease (IBD), during its first 6 months of live deployment. MethodsChatIBD is an online question-answering platform that uses retrieval-augmented generation over a curated corpus of IBD guidelines. Queries undergo hybrid semantic and keyword retrieval with query expansion and reranking, and the model is instructed to answer only from retrieved material and return linked citations. Safeguards include fixed medication dosing information from European Medicines Agency (EMA), user feedback capture, and clinician review of flagged outputs. We performed a descriptive service evaluation of aggregated, de-identified platform metrics collected between 1 October 2025 and 1 April 2026. ResultsDuring the study period, ChatIBD registered 913 users and processed 7,222 messages across 3,855 conversations. Activity was recorded across 69 countries and 28 languages, with the highest message volumes from the United Kingdom (27.1%) and Spain (12.3%). Median daily message volume was 35.5 (IQR 20 to 52), and 85.1% of messages were submitted on weekdays. Medication-related queries accounted for the largest use domain, while guideline synthesis was the most frequent inferred intent. Sixteen explicit feedback events were recorded, including one negative rating that triggered clinician review and system changes. ConclusionsChatIBD showed early international uptake and repeat use as a specialty-specific, retrieval-grounded generative AI tool for IBD professionals. These findings support the feasibility of deploying a guideline-grounded clinical AI service with practical safeguards, but do not establish response accuracy, safety, or clinical effectiveness. Formal validation is in progress. What is already known on this topicGeneral-purpose large language models are increasingly being used informally by clinicians, but concerns remain about hallucinated content, unverifiable recommendations, and poor traceability to specialty-specific sources. What this study addsThis study describes the early deployment of ChatIBD, a specialty-specific retrieval-grounded generative AI tool for IBD professionals, and the safeguards used in its live operation. How this study might affect research, practice or policyEarly evaluations of live specialist AI tools may help guide governance, implementation, and validation. Uptake alone is not evidence of effectiveness, but it can help shape priorities for subsequent studies.

20
Rheumatic Heart Disease Detection in Asymptomatic Schoolchildren using ECG and PCG

Chuma, A. T.; Wang, C.; Voigt, J.-u.; Mekonnen, D.; Asmare, M. H.; Vanrumste, B.

2026-05-15 health informatics 10.64898/2026.05.12.26352939 medRxiv
Top 0.1%
17.6%
Show abstract

Rheumatic heart disease (RHD) remains a major public health concern across low- and middle-income countries in the Global South. Early detection through community-based screening of asymptomatic individuals has been identified as a critical strategy for reducing the disease burden. Despite this, the absence of accessible, automated population screening tools continues to impede implementation at scale. This study investigates the screening potential of integrating electrocardiography (ECG) and phonocardiography (PCG) for the early detection of RHD in asymptomatic schoolchildren. The dataset was obtained as part of an ambulatory screening initiative conducted across multiple school sites in rural areas of Ethiopia. It comprised ECG and PCG recordings from 611 asymptomatic schoolchildren aged 10 to 20 years. A comprehensive set of time-frequency, visibility graph and non-linear features were extracted from both signal modalities. These features were subsequently evaluated using machine learning models to assess their utility in the automated screening of early RHD. The best model achieved an average 10-folds cross-validation scores on sensitivity, positive-predictive-value and F1-score of 59.6%, 63.6% and 60.8%, respectively for multimodal ECG and PCG signals. Whereas separate evaluation of ECG showed an F1-score of 61.1% and PCG achieved 23.5%. Key features included the T-wave, the area under the QRS complex, and entropy measures derived from beat visibility graphs in the ECG. In addition, visibility graph features from multi-band S1 and S2 heart sound segments, along with MFCC coefficients from the PCG, were also relevant. However, PCG alone performed poorly and did not show improved results over the ECG features. Although auscultation is key clinical diagnosis tool in symptomatic RHD, combined PCG with ECG features does not enhance asymptomatic RHD detection using the ECG modality alone.